Scripted vs Spontaneous Speech#

In this section, we explore the differences between scripted speech and casual/spontaneous speech. Both speaking styles feature minimal vocal variations yet impactful. It has been observed that speaking style could affect voice perception in humans in case of unfamiliar voices (Smith et al. (2019), Stevenage et al. (2021) and Afshan et al. (2022)). Accordingly, we are going to investigate the effect of speaking style on generating speech embeddings that should maintain close distances with samples from the same speaker.

Research Questions:
#

  1. Is there a noticeable within-speaker difference between scripted and spontaneous speech utterances?

  2. Would the difference change depending on the type of feature extrator used?

  3. Is this difference maintained in lower dimensions?

Dataset Description:#

The dataset used in this experiment is obtained from here. We compiled speech utterances from 26 speakers (14 females and 12 males). The collected dataset comprises 7 tasks (4 scripted/3 spontaneous).

Tasks:

  1. NWS (script): Reading ‘The North Wind and Sun’ passage

  2. LPP (script): Reading ‘The Little Prince’ scentences

  3. DHR (script): Reading ‘Declaration of Human Rights’ scentences

  4. HT2 (script): Reading ‘Hearing in Noise Test 2’ scentences

  5. QNA (spon): Answering questions ‘Q and A session’

  6. ST1 (spon): Telling a personal story 1

  7. ST2 (spon): Telling a personal story 2

The dataset was processed by undersampling to 16 kHz to be compatible with BYOL-S models. Additionally, the utterances were cropped to fixed durations (1, 3, 5, 10, 15 sec) to yield 5 new datasets generated from the original one.

Finally, the naming convention for the audio files is: {ID}{Gender}{Task}{Label}{File Number}.wav (e.g. 049_F_DHR_script_000.wav).

In the following analysis, we will be using the 3sec-utterance version of the dataset.

1) Loading Data#

import deciphering_enigma

#define the experiment config file path
path_to_config = './config.yaml'

#read the experiment config file
exp_config = deciphering_enigma.load_yaml_config(path_to_config)
dataset_path = exp_config.dataset_path

#register experiment directory and read wav files' paths
audio_files = deciphering_enigma.build_experiment(exp_config)
print(f'Dataset has {len(audio_files)} samples')
Dataset has 6471 samples
if exp_config.preprocess_data:
    dataset_path = deciphering_enigma.preprocess_audio_files(audio_files, speaker_ids=metadata_df['ID'], chunk_dur=exp_config.chunk_dur, resampling_rate=exp_config.resampling_rate, 
                    save_path=f'{exp_config.dataset_name}_{exp_config.model_name}/preprocessed_audios', audio_format=audio_format)
#balance data to have equal number of labels per speaker
audio_files = deciphering_enigma.balance_data()
print(f'After Balancing labels: Dataset has {len(audio_files)} samples')

#extract metadata from file name convention
metadata_df, audio_format = deciphering_enigma.extract_metadata(exp_config, audio_files)

#load audio files as torch tensors to get ready for feature extraction
audio_tensor_list = deciphering_enigma.load_dataset(audio_files, cfg=exp_config, speaker_ids=metadata_df['ID'], audio_format=audio_format)
After Balancing labels: Dataset has 5816 samples
Audio Tensors are already saved for scriptvsspon_speech

2) Generating Embeddings#

We are generating speech embeddings from 9 different models (BYOL-A, BYOL-S/CNN, BYOL-S/CvT, Hybrid BYOL-S/CNN, Hybrid BYOL-S/CvT, Wav2Vec2, HuBERT and Data2Vec).

#generate speech embeddings
embeddings_dict = deciphering_enigma.extract_models(audio_tensor_list, exp_config)
Load BYOL-A_default Model
BYOL-A_default embeddings are already saved for scriptvsspon_speech
(5816, 2048)
Load BYOL-S_default Model
BYOL-S_default embeddings are already saved for scriptvsspon_speech
(5816, 2048)
Load Hybrid_BYOL-S_default Model
Hybrid_BYOL-S_default embeddings are already saved for scriptvsspon_speech
(5816, 2048)
Load BYOL-S_cvt Model
BYOL-S_cvt embeddings are already saved for scriptvsspon_speech
(5816, 2048)
Load Hybrid_BYOL-S_cvt Model
Hybrid_BYOL-S_cvt embeddings are already saved for scriptvsspon_speech
(5816, 2048)
Load TRILLsson Model
TRILLsson embeddings are already saved for scriptvsspon_speech
(5816, 1024)
Load Wav2Vec2 Model
Wav2Vec2 embeddings are already saved for scriptvsspon_speech
(5816, 1024)
Load HuBERT Model
HuBERT embeddings are already saved for scriptvsspon_speech
(5816, 1280)
Load Data2Vec Model
Data2Vec embeddings are already saved for scriptvsspon_speech
(5816, 1024)

3) Original Dimension Analysis#

3.1. Distance-based#

Compute distances (e.g. cosine distance) across embeddings of utterances. Steps to compute it:

  1. Compute distances across all 5816 samples in a pairwise format (5816*5816).

  2. Convert pairwise form to long form i.e. two long columns [Sample1, Sample2, Distance], yielding a dataframe of 5816*5816 long.

  3. Remove rows with zero distances (i.e. distances between a sample and itself).

  4. Keep only the distances between samples from the same speaker and the same label (e.g. Dist{speaker1_Label1_audio0 –> speaker1_Label1_audio1}), as shown in figure below.

  5. Remove duplicates, i.e. distance between 0 –> 1 == 1 –> 0.

  6. Standardize distances within each speaker to account for within speaker variability space.

  7. Remove the distances above 99% percentile (outliers).

  8. Plot violin plot for each model, split by the label to see how are these models encode both labels.

distance

df_all = deciphering_enigma.compute_distances(metadata_df, embeddings_dict, exp_config.dataset_name, 'cosine', list(metadata_df.columns))
DF for the cosine distances using BYOL-A_default already exist!
DF for the cosine distances using BYOL-S_default already exist!
DF for the cosine distances using Hybrid_BYOL-S_default already exist!
DF for the cosine distances using BYOL-S_cvt already exist!
DF for the cosine distances using Hybrid_BYOL-S_cvt already exist!
DF for the cosine distances using TRILLsson already exist!
DF for the cosine distances using Wav2Vec2 already exist!
DF for the cosine distances using HuBERT already exist!
DF for the cosine distances using Data2Vec already exist!
def visualize_violin_dist(df_all):
    fig, ax = plt.subplots(1, 1, figsize=(30, 10))
    violin = sns.violinplot(data=df_all, x='Model', y='Distance', inner='quartile', hue='Label_1', split=True, ax=ax)
    ax.set(xlabel=None, ylabel=None)
    ax.set_xticklabels(ax.get_xticklabels(), size = 15)
    ax.set_yticklabels(ax.get_yticks(), size = 15)
    ax.set_ylabel('Standardized Cosine Distances', fontsize=20)
    ax.set_xlabel('Models', fontsize=20)

    # statistical annotation
    y, h, col = df_all['Distance'].max() + df_all['Distance'].max()*0.05, df_all['Distance'].max()*0.01, 'k'
    for i, model_name in enumerate(df_all['Model'].unique()):
        d=cohend(df_all['Distance'].loc[(df_all.Label_1=='spon') & (df_all.Model==model_name)], df_all['Distance'].loc[(df_all.Label_1=='script') & (df_all.Model==model_name)])
        x1, x2 = -0.25+i, 0.25+i
        ax.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1.5, c=col)
        ax.text((x1+x2)*.5, y+(h*1.5), f'cohen d={d:.2}', ha='center', va='bottom', color=col, fontsize=15)
    violin.legend(fontsize = 15, \
                   bbox_to_anchor= (1, 1), \
                   title="Labels", \
                   title_fontsize = 18, \
                   shadow = True, \
                   facecolor = 'white');
    plt.tight_layout()
    plt.savefig('violin_dist.png')
visualize_violin_dist(df_all)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [39], in <cell line: 1>()
----> 1 visualize_violin_dist(df_all)

Input In [38], in visualize_violin_dist(df_all)
     11 y, h, col = df_all['Distance'].max() + df_all['Distance'].max()*0.05, df_all['Distance'].max()*0.01, 'k'
     12 for i, model_name in enumerate(df_all['Model'].unique()):
---> 13     d=cohend(df_all['Distance'].loc[(df_all.Label_1=='spon') & (df_all.Model==model_name)], df_all['Distance'].loc[(df_all.Label_1=='script') & (df_all.Model==model_name)])
     14     x1, x2 = -0.25+i, 0.25+i
     15     ax.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1.5, c=col)

NameError: name 'cohend' is not defined
../../_images/scriptvsspon_analysis_19_1.png

3.2. Similarity Representation Analysis:#

import numpy as np
from tqdm import tqdm
cka_class = deciphering_enigma.CKA(unbiased=True, kernel='rbf', rbf_threshold=0.5)
num_models = len(embeddings_dict.keys())
cka_ = np.zeros((num_models, num_models))
print(cka_.shape)
for i, (_, model_1) in enumerate(tqdm(embeddings_dict.items())):
    for j, (_, model_2) in enumerate(embeddings_dict.items()):
        cka_[i,j] = cka_class.compute(model_1, model_2)
  0%|                                                                                                                                           | 0/9 [00:00<?, ?it/s]
(9, 9)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [04:51<00:00, 32.36s/it]
cka_class.plot_heatmap(cka_, embeddings_dict.keys(), save_path=f'{exp_config.dataset_name}', save_fig=True)
../../_images/scriptvsspon_analysis_22_0.png

4) Dimensionality Reduction#

The previous analysis showed how well the model is capable of grouping the uttereances of the same speaker in different cases (scripted and spontaneous) in the embedding space (high dimension). That being said, we will replicate the same analysis but in the lower dimension space to visualize the impact of speaking styles on voice identity perception.

Accordingly, we will utilize different kind of dimensionality reduction such as PCA, tSNE, UMAP and PaCMAP to get a better idea of how the speakers’ samples are clustered together in 2D. However, one constraint is that these methods are sensitive to their hyperparameters (except PCA) which could imapct our interpretation of the results. Thus, a grid search across the hyperparameters for each method is implemented.

Another issue would be quantifying the ability of these methods to perserve the distances amongst samples in the high dimension and present it in a lower dimension. To address this, we are using two metrics KNN and CPD that represent the ability of the algorithm to preserve local and global structures of the original embedding space, respectively. Both metrics are adopted from this paper in which they define both metrics as follows:

  • KNN: The fraction of k-nearest neighbours in the original high-dimensional data that are preserved as k-nearest neighbours in the embedding. KNN quantifies preservation of the local, or microscopic structure. The value of K used here is the min number of samples a speaker would have in the original space.

  • CPD: Spearman correlation between pairwise distances in the high-dimensional space and in the embedding. CPD quantifies preservation of the global, or macroscropic structure. Computed across all pairs among 1000 randomly chosen points with replacement.

Consequently, we present the results from dimensionality reduction methods in two ways, one optimimizing local structure metric (KNN) and the other optimizing global structure metric (CPD).

4.1 Mapping Labels#

tuner = deciphering_enigma.ReducerTuner()
for i, model_name in enumerate(embeddings_dict.keys()):
    tuner.tune_reducer(embeddings_dict[model_name], metadata=metadata_df, dataset_name=exp_config.dataset_name, model_name=model_name)
Tuned Reduced Embeddings already saved for BYOL-A_default model!
Tuned Reduced Embeddings already saved for BYOL-S_default model!
Tuned Reduced Embeddings already saved for Hybrid_BYOL-S_default model!
Tuned Reduced Embeddings already saved for BYOL-S_cvt model!
Tuned Reduced Embeddings already saved for Hybrid_BYOL-S_cvt model!
Tuned Reduced Embeddings already saved for TRILLsson model!
Tuned Reduced Embeddings already saved for Wav2Vec2 model!
Tuned Reduced Embeddings already saved for HuBERT model!
Tuned Reduced Embeddings already saved for Data2Vec model!
import seaborn as sns
def visualize_embeddings(df, label_name, metrics=[], axis=[], acoustic_param={}, opt_structure='Local', plot_type='sns', red_name='PCA', row=1, col=1, hovertext='', label='spon'):
    if plot_type == 'sns':
        if label_name == 'Gender':
            sns.scatterplot(data=df, x=(red_name, opt_structure, 'Dim1'), y=(red_name, opt_structure, 'Dim2'), hue=label_name, palette='deep', ax=axis)
        else:
            sns.scatterplot(data=df, x=(red_name, opt_structure, 'Dim1'), y=(red_name, opt_structure, 'Dim2'), hue=label_name
                            , style=label_name, palette='deep', ax=axis)
        axis.set(xlabel=None, ylabel=None)
        axis.get_legend().remove()
    elif plot_type == 'plotly':
        traces = px.scatter(x=df[red_name, opt_structure, 'Dim1'], y=df[red_name, opt_structure, 'Dim2'], color=df[label_name].astype(str), hover_name=hovertext)
        traces.layout.update(showlegend=False)
        axis.add_traces(
            list(traces.select_traces()),
            rows=row, cols=col
        )
    else:
        points = axis.scatter(df[red_name, opt_structure, 'Dim1'], df[red_name, opt_structure, 'Dim2'],
                     c=df[label_name], s=20, cmap="Spectral")
        return points

4.1.1. Mapping Gender#

import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots(9, 4, figsize=(40, 90))
optimize = 'Global'
reducer_names = ['PCA', 'tSNE', 'UMAP', 'PaCMAP']
for i, model_name in enumerate(embeddings_dict.keys()):
    df = pd.read_csv(f'../{exp_config.dataset_name}/{model_name}/dim_reduction.csv', header=[0,1,2])
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': '', 'Unnamed: 20_level_1': '', 'Unnamed: 20_level_2': '',
                       'Unnamed: 21_level_1': '', 'Unnamed: 21_level_2': '',},inplace=True)
    for j, name in enumerate(reducer_names):
        ax[0,j].set_title(f'{name}', fontsize=25)
        visualize_embeddings(df, 'Gender', metrics=[], axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='sns')
    ax[i, 0].set_ylabel(model_name, fontsize=25)
ax[0,j].legend(bbox_to_anchor=(1, 1.15), fontsize=20)
plt.tight_layout()
../../_images/scriptvsspon_analysis_32_0.png

4.1.2. Mapping Identity#

fig, ax = plt.subplots(9, 4, figsize=(40, 90))
optimize = 'Global'
reducer_names = ['PCA', 'tSNE', 'UMAP', 'PaCMAP']
for i, model_name in enumerate(embeddings_dict.keys()):
    df = pd.read_csv(f'../{exp_config.dataset_name}/{model_name}/dim_reduction.csv', header=[0,1,2])
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': '', 'Unnamed: 20_level_1': '', 'Unnamed: 20_level_2': '',
                       'Unnamed: 21_level_1': '', 'Unnamed: 21_level_2': '',},inplace=True)
    for j, name in enumerate(reducer_names):
        ax[0,j].set_title(f'{name}', fontsize=25)
        visualize_embeddings(df, 'ID', metrics=[], axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='sns')
    ax[i, 0].set_ylabel(model_name, fontsize=25)
plt.tight_layout()
../../_images/scriptvsspon_analysis_34_0.png

4.1.3. Mapping Speaking Style (Script/Spon)#

fig, ax = plt.subplots(9, 4, figsize=(40, 90))
optimize = 'Global'
reducer_names = ['PCA', 'tSNE', 'UMAP', 'PaCMAP']
for i, model_name in enumerate(embeddings_dict.keys()):
    df = pd.read_csv(f'../{exp_config.dataset_name}/{model_name}/dim_reduction.csv', header=[0,1,2])
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': '', 'Unnamed: 20_level_1': '', 'Unnamed: 20_level_2': '',
                       'Unnamed: 21_level_1': '', 'Unnamed: 21_level_2': '',},inplace=True)
    for j, name in enumerate(reducer_names):
        ax[0,j].set_title(f'{name}', fontsize=25)
        visualize_embeddings(df, 'Label', metrics=[], axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='sns')
    ax[i, 0].set_ylabel(model_name, fontsize=25)
ax[0,j].legend(bbox_to_anchor=(1, 1.15), fontsize=20)
plt.tight_layout()
../../_images/scriptvsspon_analysis_36_0.png

4.2 Distance in Lower Dimensions#

labels = ['script', 'spon']
dfs = []
for label in labels:
    df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
    pacmap_global_df = df.loc[:, ('PaCMAP', 'Global')]
    pacmap_global_df['wav_file'] = df['wav_file']; pacmap_global_df['label'] = label
    dfs.append(pacmap_global_df)
df = pd.concat(dfs, axis=0)
df.sample(10)
Dim1 Dim2 wav_file label
948 -16.733345 -1.996756 058_F_HT2_script_001.wav script
1696 -7.589085 -6.348401 065_F_DHR_script_010.wav script
2734 -13.431674 -4.555548 132_M_QNA_spon_049.wav spon
1765 -7.870420 -6.359378 065_F_LPP_script_016.wav script
1128 5.016654 5.860515 059_M_LPP_script_022.wav script
1121 4.929832 5.776471 059_M_LPP_script_015.wav script
428 16.376358 3.609564 052_M_NWS_script_002.wav script
2000 5.581617 -12.904393 067_F_QNA_spon_083.wav spon
1281 -1.567527 11.817277 061_M_QNA_spon_029.wav spon
2062 -12.941590 -5.509135 068_F_HT2_script_004.wav script
#create distance-based dataframe between all data samples in a square form
pairwise = pd.DataFrame(
    squareform(pdist(df.iloc[:, :2], metric='cosine')),
    columns = df['wav_file'],
    index = df['wav_file']
)
#move from square form DF to long form DF
long_form = pairwise.unstack()
#rename columns and turn into a dataframe
long_form.index.rename(['Sample_1', 'Sample_2'], inplace=True)
long_form = long_form.to_frame('Distance').reset_index()
#remove the distances computed between same samples (distance = 0)
long_form = long_form.loc[long_form['Sample_1'] != long_form['Sample_2']]
long_form.sample(10)
Sample_1 Sample_2 Distance
29460525 069_F_QNA_spon_032.wav 072_F_DHR_script_025.wav 0.726305
16264862 133_M_DHR_script_015.wav 052_M_ST2_spon_022.wav 1.053295
18851113 052_M_QNA_spon_012.wav 063_F_DHR_script_001.wav 0.662966
26444879 064_F_QNA_spon_073.wav 071_F_QNA_spon_095.wav 0.012889
28205636 067_F_QNA_spon_024.wav 058_F_QNA_spon_038.wav 0.259614
4276696 056_F_HT2_script_048.wav 067_F_DHR_script_019.wav 0.183844
28975777 068_F_QNA_spon_048.wav 053_M_DHR_script_029.wav 0.395303
19771496 053_M_QNA_spon_055.wav 049_F_QNA_spon_004.wav 1.996565
11123129 066_M_NWS_script_008.wav 049_F_QNA_spon_029.wav 0.980920
23103239 059_M_QNA_spon_025.wav 068_F_LPP_script_002.wav 1.238760
#add columns for meta-data
long_form['Gender'] = long_form.apply(lambda row: row['Sample_1'].split('_')[1] if row['Sample_1'].split('_')[1] == row['Sample_2'].split('_')[1] else 'Different', axis=1)
long_form['Label'] = long_form.apply(lambda row: row['Sample_1'].split('_')[3] if row['Sample_1'].split('_')[3] == row['Sample_2'].split('_')[3] else 'Different', axis=1)
long_form['ID'] = long_form.apply(lambda row: row['Sample_1'].split('_')[0] if row['Sample_1'].split('_')[0] == row['Sample_2'].split('_')[0] else 'Different', axis=1)
long_form.sample(10)
Sample_1 Sample_2 Distance Gender Label ID
11829677 068_F_DHR_script_007.wav 133_M_QNA_spon_060.wav 0.050603 Different Different Different
2004548 052_M_DHR_script_023.wav 058_F_QNA_spon_030.wav 0.672418 Different Different Different
32927936 132_M_QNA_spon_068.wav 056_F_QNA_spon_008.wav 1.999721 Different spon Different
32933531 132_M_QNA_spon_069.wav 052_M_ST2_spon_035.wav 0.223480 M spon Different
11423077 067_F_HT2_script_013.wav 053_M_DHR_script_017.wav 1.999219 Different script Different
22509950 058_F_QNA_spon_056.wav 068_F_DHR_script_004.wav 0.411826 F Different Different
2862502 053_M_HT2_script_025.wav 058_F_NWS_script_003.wav 1.976306 Different script Different
29512461 069_F_QNA_spon_041.wav 068_F_HT2_script_019.wav 1.277754 F Different Different
7549556 061_M_HT2_script_016.wav 052_M_LPP_script_000.wav 0.004711 M script Different
6979211 060_F_HT2_script_017.wav 049_F_DHR_script_011.wav 0.610212 F script Different
#remove distances computed between different speakers and different labels
df = long_form.loc[(long_form['Gender']!='Different') & (long_form['Label']!='Different') & (long_form['ID']!='Different')]
df.sample(10)
Sample_1 Sample_2 Distance Gender Label ID
10517184 066_M_DHR_script_015.wav 066_M_HT2_script_024.wav 0.000549 M script 066
9173513 064_F_DHR_script_012.wav 064_F_NWS_script_007.wav 0.106249 F script 064
17997754 050_M_ST1_spon_004.wav 050_M_QNA_spon_044.wav 0.000895 M spon 050
5630860 058_F_HT2_script_021.wav 058_F_HT2_script_025.wav 0.000425 F script 058
25967078 063_F_QNA_spon_100.wav 063_F_QNA_spon_090.wav 0.002420 F spon 063
31720064 072_F_QNA_spon_085.wav 072_F_QNA_spon_048.wav 0.036492 F spon 072
2536298 053_M_DHR_script_000.wav 053_M_LPP_script_025.wav 0.001557 M script 053
4409184 056_F_LPP_script_012.wav 056_F_DHR_script_012.wav 0.001254 F script 056
19603322 053_M_QNA_spon_026.wav 053_M_QNA_spon_058.wav 0.350624 M spon 053
16561024 133_M_HT2_script_027.wav 133_M_LPP_script_014.wav 0.000339 M script 133
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
sns.violinplot(data=df, x='Label', y='Distance', inner='quartile', ax=ax)
ax.set_xlabel('Labels', fontsize=15)
ax.set_ylabel('Cosine Distances', fontsize=15)

# statistical annotation
d=cohend(df['Distance'].loc[(df.Label=='spon')], df['Distance'].loc[(df.Label=='script')])
x1, x2 = 0, 1
y, h, col = df['Distance'].max() + 0.05, 0.01, 'k'
plt.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1.5, c=col)
plt.text((x1+x2)*.5, y+(h*1.5), f'cohen d={d:.2}', ha='center', va='bottom', color=col)

plt.tight_layout()
../../_images/scriptvsspon_analysis_43_0.png

5) Identity Prediction from Scripted vs Spontaneous speech#

Here, we want to see the ability of speech embeddings generated from scripted/spontaneous samples to predict speaker identity and compare both performances.

#split train and test samples for each participant
spon_df = df.loc[df.Label=='spon']
script_df = df.loc[df.Label=='script']
spon_train=[]; spon_test = []
script_train=[]; script_test = []
for speaker in df['Speaker_ID'].unique():
    speaker_spon_df = spon_df.loc[spon_df.Speaker_ID == speaker]
    speaker_script_df = script_df.loc[script_df.Speaker_ID == speaker]
    msk = np.random.rand(len(speaker_spon_df)) < 0.7
    spon_train.append(speaker_spon_df[msk])
    spon_test.append(speaker_spon_df[~msk])
    script_train.append(speaker_script_df[msk])
    script_test.append(speaker_script_df[~msk])
train_spon_df = pd.concat(spon_train)
test_spon_df = pd.concat(spon_test)
train_script_df = pd.concat(script_train)
test_script_df = pd.concat(script_test)
train_spon_features = train_spon_df.iloc[:, 4:]
train_spon_labels = train_spon_df['Speaker_ID']
test_spon_features = test_spon_df.iloc[:, 4:]
test_spon_labels = test_spon_df['Speaker_ID']
train_script_features = train_script_df.iloc[:, 4:]
train_script_labels = train_script_df['Speaker_ID']
test_script_features = test_script_df.iloc[:, 4:]
test_script_labels = test_script_df['Speaker_ID']

5.1 Identity prediction from spontaneous samples#

clf_names, clfs, params_clf = get_sklearn_models()
grid_results = {}
for i, (clf_name, clf, clf_params) in enumerate(zip(clf_names, clfs, params_clf)):
    print(f'Step {i+1}/{len(clf_names)}: {clf_name}...')    
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=_RANDOM_SEED)
    pipeline = Pipeline([('transformer', StandardScaler()), ('estimator', clf)])
    grid_search = GridSearchCV(pipeline, param_grid=clf_params, n_jobs=-1, cv=cv, scoring='recall_macro', error_score=0)
    grid_result = grid_search.fit(train_spon_features, train_spon_labels)
    grid_results[clf_name] = grid_result
    test_result = grid_result.score(test_spon_features, test_spon_labels)
    print(f'Best {clf_name} UAR: {grid_result.best_score_*100: .2f} using {grid_result.best_params_}')
    print(f'  Test Data UAR: {test_result*100: .2f}')
Step 1/3: LR...
Best LR UAR:  99.00 using {'estimator__C': 100.0, 'estimator__class_weight': None}
  Test Data UAR:  99.06
Step 2/3: RF...
Best RF UAR:  92.84 using {'estimator__class_weight': 'balanced', 'estimator__max_depth': 25, 'estimator__min_samples_split': 2}
  Test Data UAR:  91.20
Step 3/3: SVC...
Best SVC UAR:  98.43 using {'estimator__C': 100000.0, 'estimator__class_weight': 'balanced', 'estimator__kernel': 'linear'}
  Test Data UAR:  98.42

5.2 Identity prediction from scripted samples#

clf_names, clfs, params_clf = get_sklearn_models()
grid_results = {}
for i, (clf_name, clf, clf_params) in enumerate(zip(clf_names, clfs, params_clf)):
    print(f'Step {i+1}/{len(clf_names)}: {clf_name}...')    
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=_RANDOM_SEED)
    pipeline = Pipeline([('transformer', StandardScaler()), ('estimator', clf)])
    grid_search = GridSearchCV(pipeline, param_grid=clf_params, n_jobs=-1, cv=cv, scoring='recall_macro', error_score=0)
    grid_result = grid_search.fit(train_script_features, train_script_labels)
    grid_results[clf_name] = grid_result
    test_result = grid_result.score(test_script_features, test_script_labels)
    print(f'Best {clf_name} UAR: {grid_result.best_score_*100: .2f} using {grid_result.best_params_}')
    print(f'  Test Data UAR: {test_result*100: .2f}')
Step 1/3: LR...
Best LR UAR:  99.59 using {'estimator__C': 100.0, 'estimator__class_weight': 'balanced'}
  Test Data UAR:  99.17
Step 2/3: RF...
Best RF UAR:  95.61 using {'estimator__class_weight': None, 'estimator__max_depth': 25, 'estimator__min_samples_split': 5}
  Test Data UAR:  96.51
Step 3/3: SVC...
Best SVC UAR:  99.23 using {'estimator__C': 100000.0, 'estimator__class_weight': 'balanced', 'estimator__kernel': 'linear'}
  Test Data UAR:  99.53

6) Gender Features in BYOL-S#

It is evident how the model is capable of separating gender properly as shown in the dimensionality reduction plots. Accordingly, we will explore the main BYOL-S features that identify gender and remove them to see if BYOL-S representation would still be capable of maintaining gender separation or would it shed more light on a different kind of acoustic variation.#

Methodology:#

  1. Train 3 classifiers (Logistic Regression ‘LR’, Random Forest ‘RF’ and Support Vector Classifier ‘SVC’) to predict gender from BYOL-S embeddings.

  2. Select the top important features in gender prediction for each trained model.

  3. Extract the common features across the 3 classifiers.

  4. Remove these features from the extracted embeddings and apply dimensionality reduction to observe changes.

Model Training: The training process constitutes running 5-fold CV on standardized inputs and reporting the best Recall score.#

6.1 Train Classifiers#

#binarize the gender label
gender_binary = pd.get_dummies(gender)
gender_binary = gender_binary.values
gender_binary = gender_binary.argmax(1)

#define classifiers' objects and fit dataset
clf_names, clfs, params_clf = get_sklearn_models()
grid_results = {}
for i, (clf_name, clf, clf_params) in enumerate(zip(clf_names, clfs, params_clf)):
    print(f'Step {i+1}/{len(clf_names)}: {clf_name}...')    
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=_RANDOM_SEED)
    pipeline = Pipeline([('transformer', StandardScaler()), ('estimator', clf)])
    grid_search = GridSearchCV(pipeline, param_grid=clf_params, n_jobs=-1, cv=cv, scoring='recall_macro', error_score=0)
    grid_result = grid_search.fit(byols_embeddings, gender_binary)
    grid_results[clf_name] = grid_result
    print(f'Best {clf_name} UAR: {grid_result.best_score_*100: .2f} using {grid_result.best_params_}')
Step 1/3: LR...
Best LR UAR:  99.98 using {'estimator__C': 100000.0, 'estimator__class_weight': 'balanced'}
Step 2/3: RF...
Best RF UAR:  99.13 using {'estimator__class_weight': None, 'estimator__max_depth': 20, 'estimator__min_samples_split': 10}
Step 3/3: SVC...
Best SVC UAR:  99.96 using {'estimator__C': 0.001, 'estimator__class_weight': None, 'estimator__kernel': 'linear'}

6.2 Select the important features for gender prediction#

#select top k features from all classifiers
features = []; k=500
for clf_name in clf_names:
    features_df = eval_features_importance(clf_name, grid_results[clf_name])
    features.append(features_df.index[:k])
#get common features among selected top features
indices = reduce(np.intersect1d, (features[0], features[1], features[2]))
#create one array containing only the common top features (gender features) and another one containing the rest (genderless features)
gender_embeddings = byols_embeddings[:, indices]
genderless_embeddings = np.delete(byols_embeddings, indices, axis=1)
Extract important features from LR model:
Extract important features from RF model:
Extract important features from SVC model:

7) Acoustic Features Analysis in BYOL-S#

In this section, we will compute some acoustic features (F0 and loudness) from the audio files and see their distribution in the 2D dimensionality reduction plots.

import pyloudnorm as pyln
f0s = []; loudness = []; mffcc_1 = []; rms=[]
for file in tqdm(wav_files):
    audio, orig_sr = sf.read(file)
    
#     #measure the median fundamental frequency
#     f0 = librosa.yin(audio, fmin=librosa.note_to_hz('C1'),
#                             fmax=librosa.note_to_hz('C7'), sr=orig_sr)
#     f0s.append(np.nanmedian(f0))
    
#     #measure the loudness 
#     meter = pyln.Meter(orig_sr) # create BS.1770 meter
#     l = meter.integrated_loudness(audio)
#     loudness.append(l)
    
#     #measure the first mfcc
#     mfccs = librosa.feature.mfcc(audio, sr=orig_sr)
#     mffcc_1.append(np.nanmedian(mfccs[0,:]))
    
    #measure rms
    rms.append(np.nanmedian(librosa.feature.rms(audio)))
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5816/5816 [00:11<00:00, 519.77it/s]
with open("rms.pickle", "wb") as output_file:
    pickle.dump(rms, output_file)
with open("f0s.pickle", "rb") as output_file:
    f0s = np.array(pickle.load(output_file))
with open("loudness.pickle", "rb") as output_file:
    loudness = np.array(pickle.load(output_file))
with open("mfcc_1.pickle", "rb") as output_file:
    mfcc_1 = np.array(pickle.load(output_file))
with open("rms.pickle", "rb") as output_file:
    rms = np.array(pickle.load(output_file))

Plotting the Median F0 of audio samples across 4 dimensionality reduction methods

fig, ax = plt.subplots(2, 4, figsize=(30, 15))
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
    indices = list(np.where(labels == label)[0])
    df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
    df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]
    df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
    for j, name in enumerate(reducer_names):
        max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
        metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
        points = visualize_embeddings(df, 'f0', metrics=metric, axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='colorbar')
    ax[i, 0].set_ylabel(label, fontsize=15)
cbar = fig.colorbar(points, ax=ax.ravel().tolist())
cbar.ax.set_ylabel('Median F0', rotation=270)
plt.show()
../../_images/scriptvsspon_analysis_74_0.png
fig = make_subplots(rows=2, cols=4)
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
    df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
    indices = list(np.where(labels == label)[0])
    df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]
    df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
    for j, name in enumerate(reducer_names):
        max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
        metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
        visualize_embeddings(df, 'f0', metrics=metric, axis=fig, opt_structure=optimize, red_name=name, plot_type='plotly', row=i+1, col=j+1, hovertext=df['wav_file'], label=label)
fig.update_layout(
    autosize=False,
    width=1600,
    height=1200, showlegend=False,)
fig.show()

Plotting the Loudness of audio samples across 4 dimensionality reduction methods

fig, ax = plt.subplots(2, 4, figsize=(30, 15))
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
    indices = list(np.where(labels == label)[0])
    df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
    df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]
    df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
    for j, name in enumerate(reducer_names):
        max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
        metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
        points = visualize_embeddings(df, 'loudness', metrics=metric, axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='colorbar')
    ax[i, 0].set_ylabel(label, fontsize=15)
cbar = fig.colorbar(points, ax=ax.ravel().tolist())
cbar.ax.set_ylabel('Loudness', rotation=270)
plt.show()
../../_images/scriptvsspon_analysis_77_0.png
fig = make_subplots(rows=2, cols=4)
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
    df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
    indices = list(np.where(labels == label)[0])
    df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]
    df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
    for j, name in enumerate(reducer_names):
        max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
        metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
        visualize_embeddings(df, 'loudness', metrics=metric, axis=fig, opt_structure=optimize, red_name=name, plot_type='plotly', row=i+1, col=j+1, hovertext=df['wav_file'], label=label)
fig.update_layout(
    autosize=False,
    width=1600,
    height=1200, showlegend=False,)
fig.show()

Plotting the median of first MFCC of audio samples across 4 dimensionality reduction methods

fig, ax = plt.subplots(2, 4, figsize=(30, 15))
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
    indices = list(np.where(labels == label)[0])
    df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
    df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]; df['mfcc_1'] = mfcc_1[indices]
    df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
    for j, name in enumerate(reducer_names):
        max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
        metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
        points = visualize_embeddings(df, 'mfcc_1', metrics=metric, axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='colorbar')
    ax[i, 0].set_ylabel(label, fontsize=15)
cbar = fig.colorbar(points, ax=ax.ravel().tolist())
cbar.ax.set_ylabel('Median MFCC 1', rotation=270)
plt.show()
../../_images/scriptvsspon_analysis_80_0.png
fig = make_subplots(rows=2, cols=4)
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
    df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
    indices = list(np.where(labels == label)[0])
    df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]; df['mfcc_1'] = mfcc_1[indices]; df['rms'] = rms[indices]
    df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
    for j, name in enumerate(reducer_names):
        max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
        metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
        visualize_embeddings(df, 'mfcc_1', metrics=metric, axis=fig, opt_structure=optimize, red_name=name, plot_type='plotly', row=i+1, col=j+1, hovertext=df['wav_file'], label=label)
fig.update_layout(
    autosize=False,
    width=1600,
    height=1200, showlegend=False,)
fig.show()

Plotting the median of RMS of audio samples across 4 dimensionality reduction methods

fig, ax = plt.subplots(2, 4, figsize=(30, 15))
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
    indices = list(np.where(labels == label)[0])
    df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
    df['rms'] = rms[indices]
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
    for j, name in enumerate(reducer_names):
        max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
        metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
        points = visualize_embeddings(df, 'rms', metrics=metric, axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='colorbar')
    ax[i, 0].set_ylabel(label, fontsize=15)
cbar = fig.colorbar(points, ax=ax.ravel().tolist())
cbar.ax.set_ylabel('Median RMS', rotation=270)
plt.show()
../../_images/scriptvsspon_analysis_83_0.png
fig = make_subplots(rows=2, cols=4)
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
    df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
    indices = list(np.where(labels == label)[0])
    df['rms'] = rms[indices]
    df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
    for j, name in enumerate(reducer_names):
        max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
        metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
        visualize_embeddings(df, 'rms', metrics=metric, axis=fig, opt_structure=optimize, red_name=name, plot_type='plotly', row=i+1, col=j+1, hovertext=df['wav_file'], label=label)
fig.update_layout(
    autosize=False,
    width=1600,
    height=1200, showlegend=False,)
fig.show()